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Abstract 

With the advent of high-throughput measurement techniques, scientists and engineers are starting to 
grapple with massive data sets and encountering challenges with how to organize, process and extract 
information into meaningful structures. Multidimensional spatio-temporal biological data sets such as 
time series gene expression with various perturbations over different cell lines, or neural spike trains 
across many experimental trials, have the potential to acquire insight across multiple dimensions. For 
this potential to be realized, we need a suitable representation to understand the data. Since a wide range 
of experiments and the unknown complexity of the underlying system contribute to the heterogeneity of 
biological data, we propose a method based on Robust Principal Component Analysis (RPCA), which 
is well suited for extracting principal components when there are corrupted observations. The proposed 
method provides us a new representation of these data sets in terms of a common and aberrant response. 
This representation might help users to acquire a new insight from data. 

Author Summary 

One of the most exciting trends and important themes in science and engineering involves the use of 
high-throughput measurement data. With different dimensions, for example, various perturbations, dif- 
ferent doses of drug or cell lines characteristics, such multidimensional data sets enable us to understand 
commonalities and differences across multiple dimensions. A general question is how to organize the ob- 
served data into meaningful structures and how to find an appropriate similarity measure. A natural way 
of viewing these complex high dimensional data sets is to examine and analyze the large-scale features 
and then to focus on the interesting details. With this notion, we propose an RPCA-based method which 
models common variations as approximately the low-rank component and anomalies as the sparse com- 
ponent. We show that the proposed method is able to find distinct subtypes and classify data sets in a 
robust way without any prior knowledge by separating these common responses and abnormal responses. 

Introduction 

Over the last years, the use of high-throughput measurement data has become one of the most exciting 
trends and important themes in science and engineering. This is becoming increasingly important in 
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biology. However, handling and analyzing biological data have challenges all of their own because the data 
sets are typically heterogeneous. Biological data can not only stem from a wide range of experiments such 
as inhibitions/stimulations, different doses of drugs, and various cell lines (Figure 1) but also represent 
the (unknown) complexity of the underlying system [1] . 

With the explosion of the amount of various biological data, a general question is how to organize 
the observed data into meaningful structures and how to find an appropriate similarity (or dissimilarity) 
measure which is critical to the analysis. Since such multidimensional data have the potential to provide 
insight across multiple dimensions, these data can enable users to start to develop models and draw 
hypotheses that not only describe the spatial and temporal dynamics of the biological system but also 
inform them about commonalities and differences across dimensions. A significant challenge for creating 
suitable representations is to continue handling large data sets and to effectively deal with the growing 
diversity and quantity of the data set. 

A natural way of viewing these complex high dimensional data sets is to examine and analyze the 
large-scale features and then to focus on the interesting details. The potential of clustering to reveal bio- 
logically meaningful patterns in microarray data was first realized and demonstrated in an early paper by 
Eisen et al [2]. Thereafter, in many biological applications, different methods have been used to analyze 
gene expression data and characterize gene functional behavior. Among various data-driven modeling 
approaches, clustering methods are widely used on these data to categorize genes with similar expression 
profiles. However, until recently, most studies have focused on the spatial, rather than temporal, struc- 
ture of data. For instance, neural models are usually concerned with processing static spatial patterns 
of intensities without regard to temporal information [3]. Since many existing data-driven modeling ap- 
proaches such as clustering, classification or inference using biological data focus on static data, they 
have limitations in analyzing multi-dimensional spatio-temporal data sets. 

Recently, much research has focused on time series high-throughput data sets. These data sets have 
the advantage of being able to identify dynamic relationships between genes or neurons since the spatio- 
temporal pattern results from integration of regulatory signals or electrochemical signals through the 
network over time. For example, time series gene-knockout experiment data sets provide the distinct 
possibility of observing the cellular mechanisms in action [4]. Also, these data sets help us to unravel 
the mechanistic drivers characterizing cellular response and to break down the genome into sets of genes 
involved in the related processes [5]. Moreover, instead of concentrating on steady state response, moni- 
toring dynamic patterns provides a profoundly different type of information. For instance, several recent 
studies focus on the temporal complexity and heterogeneity of single-neuron activity in the premotor and 
motor cortices [3] [6] [7]. Moreover, since many current and emerging cancer treatments are designed to 
inhibit or stimulate a specific node (or gene) in the networks and alter signaling cascades, advancing our 
understanding of how the system dynamics of these networks is deregulated across cancer cells and finding 
subgroups of genes and conditions will ultimately lead to the more effective treatment strategies [8] . 

In this paper, we propose the Robust Principal Component Analysis (RPCA)-based method for ana- 
lyzing spatio-temporal biological data sets. To demonstrate that our method helps users acquire insight 
efficiently and to emphasize that the proposed method can be applicable to various domains, we con- 
sider two different systems 1) neural population dynamics and 2) a gene regulatory network. Since the 
proposed method uses the common dynamic features in the spatio-temporal data set, it is important to 
arrange individual data sets in order to make them amenable to this analysis. 
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Background 

Motivation 

1) Neural Population Dynamics 

Neural ensemble activity is typically studied by averaging noisy spike trains across multiple experimental 
trials to obtain an approximate neural firing rate that varies smoothly over time. However, if neural 
activity is more a reflection of internal neural dynamics rather than response to external stimulus, the 
time series of neural activity may differ even when the subject is performing nominally identical tasks [7]. 
In [6], Churchland et al. showed that neural activity patterns in the primary motor cortex and dorsal 
premotor cortex of the macaque brain associated with nearly identical velocity profiles can be very 
different. This is particularly true of behavioral tasks involving perception, decision making, attention, 
or motor planning. In these settings, it is critical not to average the neural data across trials, but to 
analyze it on a trial-by-trial basis [3]. Moreover, stimulus representations in some sensory systems are 
characterized by the precise spike timing of a small number of neurons [10] [11] [12], suggesting that the 
details of operations in the brain are embedded not only in the overall neural spike rate, but also in the 
timings of spikes. 

The motor and premotor cortices have been extensively studied but their dynamic response properties 
are poorly understood [3]. Moreover, the role of motor cortex in arm movement control is still unclear, 
with experimental evidence supporting both low-level muscle control as well as high-level kinematic 
parameters. We can define the motor cortical activity, which represents movement parameters as per 
equation (1), and the dynamical system that generates movements as per equation (2) [3]: 

Xi(t) = /ii(param 1 (t),param 2 (t),param 3 (i), ...) (1) 
x(t) - /(x(t))+u(t) (2) 

where Xi(t) is the firing rate of neuron i at time t, hi is its tuning function, and each parang may represent 
a movement parameter such as hand velocity, target position or direction. In (2), x £ K™ is a vector 
describing the firing rate of all neurons where n is the number of neurons, x is its derivative, / is an 
unknown function, and u is an external input. In (2), neural activity is governed by the underlying dy- 
namics /(•), so the characteristics of dynamical system should be present in the population activity. Also, 
if we align spatio-temporal neural activity as shown in Figure 2(b), we may extract these characteristics. 



2) Gene Regulatory Network 

In microarray data, missing and corrupted data, including arbitrary corruptions by human error during 
biological experiments, are quite common, and not uniform across samples. Two strategies for dealing 
with missing values are either to modify clustering methods so that they can deal with missing values, 
or impute a "complete" data set before clustering [13]. 

Consider collections of time series gene expression of breast cancer cell lines or microarray data sets 
from pathway-targeted therapies involving gene knockout experiments. When a specific gene is perturbed 
as shown in Figure 2(c), the broad gene expression levels of other genes might be perturbed over time. 
Thus, comparing gene expression levels in the perturbed system with those in the unperturbed system 
reveals the extra information that is the different cellular mechanisms in action. A dynamical system of 
the gene regulatory network can be modelled as follows: 

. , . _ J /(x(t)) (w/o perturbation or wild-type) , . 

|_ /(x(<)) +#r.i(x(t)) (perturbed / mutant-specific part) 

where x(t) £ W 1 denotes the concentrations of the rate-limiting species, x(£) represents the change in 
concentration of the species over time t, n is the number of genes, /(•) represents the vector field of 
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the typical dynamical system (or wild- type) and <?{•}(•) represents an additional perturbation or mutant- 
specific vector field (blue and red edges in Figure 2(c)). In other words, we have a unified model for 
wild-type cell line, x(i) = f(pc(t)) and in the mutant or perturbation case, we invoke a single change 
to the network topology or add a single influence for a specific gene. Here, additional vector fields 
such as giAp('), ffAkti(') and <7m(') are assumed to be sparse (i.e., affect only a single gene expression). 
Although these additional vector fields affect only a single gene expression at time t, their influence can 
be propagated through the network over time. 



Robust Principal Component Analysis (RPCA) 

In the computer vision literature [9] , an interesting separation problem is introduced where the observed 
data matrix can be decomposed into an unseen low-rank component and an unseen sparse component. 
The method called Robust Principal Component Analysis (RPCA) is a provably correct and efficient 
algorithm for the recovery of low-dimensional linear structure from non-ideal observations, incorporating 
for example, occlusions, malicious tampering, and sensor failures. 

In video surveillance, we need to identify activities that stand out from the background given a 
sequence of video frames [9]. Figure 2(a) shows that if we stack the video frames as rows of a matrix 
M £ M. qxPx ' Py where q is the number of frames for a given time window, and P x and P y represent 
the number of pixels of 2-D images respectively, then across each row of M, there exists a common 
component that is the stationary background and a changing component which are the moving objects in 
the foreground at each image frame. Here, the data matrix M is an input for RPCA and the output is both 
the stationary background represented as a matrix L £ R9 xf, *' p a and the moving objects in the foreground 
represented as a matrix S £ M, qxPm ' Py . With only one frame, the moving objects cannot be identified 
from the stationary background. However, by stacking all the vectorized frames such that all the frames 
align across the column direction as shown in Figure 2(a), we can identify the stationary backgrounds 
which are common variations, and then capture the moving objects which are sparse components for each 
frame. 

With this notion, suppose we are given a large data matrix M, which has principal components in 
the low- rank component and may contain some anomalies in the sparse component. Mathematically, it is 
natural to model the common variations as approximately the low-rank component L, and the anomaly 
as the sparse component S. In [9], Candes et al. formulate this as follows: 

minllLL+AllSllj s.t. M = L + S (4) 

where || L || ^ denotes the so-called nuclear norm of the matrix L, which is the sum of the singular value 
of L, and || S|] = represents Zi-norm of S. The turning parameter A may be varied to put 

more importance on the rank of L or the sparseness of S. Choosing the tuning parameter A to be 
A = 1/ y // max(q, P x ■ P y ), works well in practice [9]. However, appropriate choice of A remains an open 
problem. Thus, we can use A as a turning parameter to trade off more importance between L and S. 



How to Construct the Data Matrix M 

In the video surveillance example shown in Figure 2(a), each row of M represents the vectorized 2-D 
images at each time frame. Since each image consists of the stationary background (Li >: ) and the moving 
objects in the foreground (Sj, : ) at each time i, we denote M as follows: 





Mi . 




Ll,: 




Si, 


M = 






L 2>: 


+ 


Sa,: 




M 9 , : _ 











= L + S 



(5) 
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where M,, : represents the z-th row of M. If there were no moving object in the foreground and no 
variation for a given video sequence (i.e., Vi,Sj ): = 0), L i : (= Lj j: (z 7^ j)) would represent the common 
stationary background. On the other hand, if not (i.e., Si t - ^ 0), M represents the aligned corrupted 
measurements M^.. Although the measurements are corrupted by moving objects in the foreground, we 
are able to separate L and S under certain conditions [9]. 

1) Neural Population Dynamics 

Recall equation (2) and consider an experiment involving a non-human primate subject instructed to 
make visually-guided planar reaches with its hand. During the experiment, hand position and velocity, 
as well as the discharge of neurons from primary motor cortex and dorsal premotor cortex were recorded. 
See recerence [14] for details on the data sets. All procedures were conducted in compliance with the 
National Institute of Health Guide for Care and Use of Laboratory Animals and were approved by the 
University of California, Berkeley Institutional Animal Care and Use Committee. Then, hand velocity 
data were decomposed into a sum of minimum-jerk basis functions where submovement representation 
is a type of motor primitive; for example, the hand speed profile as a function of time resulting from 
arm movements can be represented by a sum of bell-shaped functions as shown in Figure 2(b), each of 
which is called a submovement [14] and denoted as different trials. In Figure 2(b), each red bar denotes 
submovement onset, i.e., when the subject triggers submovement. 

Suppose we align the spatio-temporal neural activity x 4 [<] = [x l (io), x 4 (ti), x*(iAr T _i)] e jgj lxN T 
governed by (2) with submovement onset where the superscript i represents the z-th trial and Nj> repre- 
sents the number of time points for the chosen time window. Then, M may be represented as follows: 





'x{[t] x\[t] . 


■ 4M~ 








M = 


x\[t\ 4[t] . 






■ 4]4x 


(6) 




xiit] xik '. 


■ <%[t]_ 









where Xi = [ejx 1 [t};ejx 2 [t}; ...;ejx q [t]\ £ Ri xN t represents the temporal neural activity of the z-th 
neuron, e.; G K ra is a unit vector, and q is the number of trials or submovements. Thus, each row 
of M represents the vectorized spatio-temporal neural response for the each trial. Note that we align 
each spatio-temporal data set x J '[£] with the same temporal condition (submovement onset) as shown in 
Figure 2(b) but we do not separate different types of submovement. For example, submovements with 
different reach directions, or with different ordinal positions in an overlapped series of submovements, 
are combined in our input matrix X. With the similar notion of the stationary background in video 
surveillance, some portion of the variability may reflect common dynamic features (L) corresponding to 
triggering submovement even though the responses of each neuron are corrupted by task-irrelevant neural 
responses (S) and may vary significantly across many trials. 

2) Gene Regulatory Network 

Recall equation (3) and consider Figure 2(c). In (3), the vector field (<?{.}) represents a single influence 
for a specific gene, yet this single influence can be propagated through the network over time. For 
example, when we inhibit Xj , the gene expression levels of other genes can be perturbed over time. If Xj 
is connected with only few genes, this perturbation may only affect a small fraction of the total number 
of gene expression levels. 

Similar to equation (6), we construct M using gene expression time series data with q different 
perturbations and/or different cell lines. Here, each row of M € R ?x ™ ,JV:r represents the vectorized time 
series gene expression x [t] G « nxN -r (n: the number of genes, Nt- the number of time points and q: 
the number of different perturbation conditions including the number of different cell lines) and different 
rows represent spatio-temporal responses of different perturbations or different cell lines. 



Downloaded from http://biorxiv.org/on September 18, 2014 



G 

Since time series gene expression results from integration of regulatory signals constrained by the gene 
regulatory network, the input matrix M may reflect common dynamic response corresponding to the 
characteristics of the network structure. Intuitively, in video surveillance, if someone stays motionlessly 
in all the frames, the RPCA algorithm discriminates him as a low rank component. Unless he moves, we 
could not see the background because he always blocks the background. Similarly, in order to extract 
common response of gene regulatory network exactly, we should perturb the entire network arbitrarily 
and uniformly. 

Results 

Disentangling the Low-rank and Sparse components 

In [9], Candes et al. discuss the identifiability issue. To make the problem meaningful, the low-rank 
component L must not be sparse. Another identifiability issue arises if the sparse matrix has low-rank. 
In many computer vision applications, practical low-rank and sparse separation gives visually appealing 
solutions. 

However, for neural activity data, only a small subset of the whole ensemble of neurons is active at 
any moment as shown in Figure 3(left). Since M is sparse, the low-rank component might be sparse. 
Also, for the pathway targeted therapies, since gene regulatory networks are known to be sparse, a large 
subset of the whole ensemble of genes might be deactivated at any moment. Moreover, the original 
distributions of the amplitude of individual neuronal activities or gene expressions are highly skewed. For 
example, neural activities often form very eccentric clusters shown in Figure 3(left); some neurons are 
highly activated (30-40 spikes/sec) but others typically have only a few spikes per second. Similarly, gene 
expressions form very eccentric clusters since each gene expression shows different scales in practice. 

These imply that practical low-rank and sparse separation seems to be ambiguous and might present 
a challenge to achieving biologically meaningful solutions in both neural activity analyses and gene 
knockout experiment data sets. To remedy this identifiability issue, we propose the RPCA-based method 
in conjunction with Random Projection (RP); RP can de-sparsity the input data set and make a highly 
eccentric distribution more spherical so it makes the singular vectors of the low-rank matrix reasonably 
distributed, (see Methods section: Random Projection (RP) and Identifiability for details) 

Numerical Example 

To illustrate the issue of identifiability and how RP can alleviate the issue, we consider a simple example: 
we generate a sparse low-rank input matrix X £ ]^50x2-io ^ _ ^ n = 2, Nt = 10) where the rank of X 
is 6 as shown in Figure SI (a). Note that in this example we chose the same dimension for the input X 
and Y (refer to (7) and (8), no dimension reduction). This is done so that \& G ]g mx ™ m equation (7) is 
invertible (we choose m = n and a nonsingular matrix \&), allowing us to compare the outputs of RPCA 
and RP-RPCA directly, as described below. Here, by using RP, we take advantage of de-sparsifying our 
input data and reducing the eccentric distribution. In general, choosing m < n makes Y much denser 
because information is compressed by RP. 

To evaluate the performance of separation into a low-rank and a sparse component, we add sparse 

Corruption fol X. ^-corruption X -|- ^corruption and ^corruption — ^■corruption'^- XR- -|- S corruption'^- 

where R = (~^f T <E> Ijv r ) ^ s the projection so Y corruption is the projected corrupted input ^corruption- To 
compare the performance of RP-RPCA with RPCA, we first decompose Y corruption into its low-rank and 
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sparse components. Then, we invert the projection: 

w- t rpca i nrpca 

^corruption T" (orginal RPCA) 

_ V T? _1 — CI rpca -L O r P ca \T>-l 

11 corruption** 1 V Y ' Y / 

A f rpca i crpca 

— Li + S (RP-RPCA) 

where we define L r P ca ± L^R" 1 and S r P ca 4 S^R" 1 . 

Figure 4 shows statistics of both RPCA and RP-RPCA (in which RPCA is applied to the matrix Y) as 
a function of the tuning parameter A in equation (4). In this example, A* = l/\/max(g, n ■ Nt) = 1/V50. 
Since our input is still sparse in this example, the rank of both L rpca , L r P ca is 15 for A* = 0.141 (rank(X) = 
6). If we choose A = 0.113 (discounting the penalty for sparse component), the ranks of L rpca , L rpc0 are 
approximately 6, which is the same as the rank of the original input X. With this choice of A, for RPCA 
we find that ||S r ' pca || is much bigger than the original corruption signal ||X cor . r?)Itt i 0 „ — X|| = ||S corrupt j on ||. 
On the other hand, for RP-RPCA, we have ||S rpco || w ||S corr „ ptio „||. Therefore, for RP-RPCA, the 
separation of the low-rank component and sparse component is close to the true solution; for the original 
RPCA, there is mis- identification in both low-rank and sparse components (more detailed information is 
provided in Figure S2). 

Application to Neural Data 

Figure 3(left) shows the recorded neural activity aligned with submovement onset. The aligned neural 
activity shows that the ratios between units' mean firing rates are fairly constant from the salient vertical 
striations in the plots and that temporal patterns exists across all the submovements. Also, as mentioned 
previously, the neural population activities are sparsely active (white color represents 0 spikes/sec) and 
show eccentric behavior; for example, some neurons have a much higher spiking rate than others. 

Figure 3(middle) (right) show the low-rank matrix from both RPCA and RP-RPCA respectively (for 
simple comparison, we choose m = n). Since X is sparse and has an eccentric distribution, the singular 
vectors may not be reasonably spread out. Applying RPCA directly to X would result in the low-rank 
component being composed of only highly modulated neural activity (middle). On the other hand, RP- 
RPCA can extract the low-rank component from a more distributed set of neural dimensions than RPCA 
alone can. Also, the result of RP-RPCA gives a more visually appealing solution. 

Since we extract neural features which represent common dynamic patterns across many experimental 
trials, we can use these features to detect and predict the onset of submovements. Here, we simply use 
the correlation between the extracted neural features and the neural signals. To accurately predict 
submovement onset times found by submovement decomposition, the correlation function should peak 
around the movement onset time. The following observations suggest the potential application of RP- 
RPCA to predict movement execution in a closed-loop Brain Machine Interface (BMI) system: 

• (observation 1) Figure 5(a) represents the receiver operating characteristic (ROC) curve of the 
prediction of submovement onset time. We can see that the overall prediction performance based 
on RP-RPCA is better than the performance based on RPCA; we can reduce the false positive rate 
while increasing the true positive rate. 

• (observation 2) Figure 5(b) shows the ROC curves of the prediction of submovement onset for 
different subjects or various tasks including center-out task and random-pursuit. This prediction 
could allow correction of movement execution errors in a closed-loop BMI system. 

Application to gene knockout experiments 

To test the proposed RP-RPCA algorithm, we consider gene knockout experiments using SKBR3 cell 
line [4] which has been used in studies of Human Epidermal Growth Factor Receptor2 (HER2) positive 
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breast cancer. We chose this data set because it has 16 perturbations using a single cell line and contains 
15 gene expressions with 4 time points as shown in Figure 6 (top row). The middle row represents the 
low-rank component and the bottom row represents the highly aberrant sparse component. In raw data 
(top row), nearly all treatments show differential responses. However, the low-rank component (middle 
row) can be categorized into approximately 3-4 subtype responses, and the sparse component (bottom 
row) shows genomic aberration-specific responses. 

Also, the following observations suggest mechanisms of response and resistance which may inform 
unanticipated biological insight. 

• (observation 1) mTOR inhibition (the second column in the bottom row) shows aberration re- 
sponses in DEPTOR, pHER3, IRS-1 and pAkt(308, 473). In [15], DEPTOR is identified as an 
mTOR-interacting protein whose expression is negatively regulated by mTORCl and mTORC2; 
Also, Peterson et al. found that DEPTOR overexpression suppresses S6K1 but it activates Akt by 
relieving feedback inhibition from mTORCl to PI3K signaling. Therefore, high DEPTOR expression 
is necessary to maintain PI3K and Akt activation and is consistent with the previous result [15]. 

• (observation 2) HER2 inhibition (the sixth column in the bottom row) results in aberration 
responses of HER3, pAkt(473) and DEPTOR. Figure S3 represents an abstract model of HER2 
overexpressed breast cancer by biological interpretation. Since high DEPTOR expression represents 
low mTORCl and mTORC2 [15], there are increasing activated HER3 and Akt by relieving inhibition 
according to this model. The more interesting fact is that PHLPP is known to dcphosphorylate 
SER473 in Akt (i.e., partially inactivating the kinase) which is captured in the sparse component 
pAkt(473). 

• (observation 3) S6K inhibition (the third column in the bottom row) results in aberration re- 
sponses of pAkt(473). Since S6K is located downstream of the Akt-TSC2-mT0RC pathway, S6K 
inhibition captures only activation of pAkt(473). 

• (observation4) PI3K inhibition (the 7th-llth columns in the bottom row) leads to increase phos- 
phorylation of MAPK. 

We separate the common response from the aberrant responses using the proposed method. Since ab- 
normal behaviors or different responses to external stimuli or different cell lines can be extracted from 
the information available in the data set, we could cluster data correctly and reveal biological meaningful 
patterns (see Methods section: Cluster Analysis for details). Figure 7 shows the clustered result 
using existing hierarchical clustering (raw data M, d xy in (9)) and the proposed method ([L S] , in 
(10)) respectively. We match the clustered results with graphical representation and our clustered result 
is more consistent with the known network structure and responses than the result of existing hierarchical 
clustering. 

Application to RPPA (Reverse Phase Protein Arrays) data set 

Breast cancers are comprised of distinct subtypes which may respond differently to pathway-targeted 
therapies; collections of breast cancer cell lines show differential responses across cell lines and show 
subtype-, pathway-, and genomic aberration-specific responses [8]. These observations suggest mecha- 
nisms of response and resistance which differ across cell lines. Here, we use a data set generated in the 
Gray Lab using Reverse Phase Protein Arrays (RPPA) from the Mills Lab [16] which presents a time 
course analysis on 11 cell lines (all HER2 amplified: 6 PI3K mutant, 5 PI3K wild-type) in response to 
Lapatinib, Akt inhibitor and combination of the two. The time course for RPPA is at 30min, lh, 2h, 4h, 
8h, 24h, 48h and 72h post-treatment. 

As shown in Figure 8(top row), Lapatinib treatment results in down-regulation of a variety of phos- 
phoproteins in the signaling pathway. From the raw data (M) or low-rank component (L), we can easily 
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observe down-regulation and slow-recovery of the levels of activation, but the levels of activation are 
higher in the PI3K mutation cell lines. Treatment with Akt inhibitor leads to down-regulation of pro- 
teins (downstream of Akt) in all HER2 amplified cell lines, although the amplitude of down-regulation 
is slightly less in cell lines with PI3K mutations. In the PI3K mutation cell lines, treatment with the 
combination of Lapatinib and Akt inhibitor leads to further down-regulation of the Akt signaling pathway 
but Akt levels are intermediate in comparison to those observed with inhibitor alone. Although these 
observations are still interesting, more interesting details might be in both the low-rank component L 
and the sparse component S: 

• (observation 1) In the PI3K mutation with applying both inhibitors, full inhibition of pS6RP is 
observed and these results show the synergistic effect of Lapatinib and Akt inhibitor (in the bottom 
row, low-rank component). 

• (observation 2) The main difference between wild-type and PI3K mutant is the response of pS6RP 
and p70S6K. For the wild-type cell lines, all treatments result in down-regulated pS6RP and p70S6K. 
However, for PI3K mutant cells, all treatments result in up-regulation pS6RP and p70S6K in the 
short-term (in the sparse component, red) and down- regulation in the long-term. Suppressing 
pS6RP relieves feedback inhibition and activates Akt. This difference makes PI3K mutation cells 
more resistant to HER2 inhibitors than their wild-type counterparts. This finding is not obvious 
when we take a look at the raw data. Furthermore, our method makes our finding more convincing 
not by visually searching M, but by finding these effect automatically by separating common 
response (L) and aberrant behavior (S) by solving (4). 

• (observation 3) BT474 shows aberrant behavior as shown in Figure S4. This mutation has been 
reported to confer weak oncogenicity, unlike the other PI3K mutations. 

Figure 9 shows the clustered result using existing hierarchical clustering and the proposed method respec- 
tively. Our clustered result is more robust and unaffected by different treatments due to the separation of 
the common responses. On the other hand, the clustered group based on existing hierarchical clustering 
changes across different treatments even though the characteristics of cell lines are not changing. 

Discussion 

Clustering and network inference are usually developed independently. For instance, until recently, most 
studies of gene regulatory network inference focus on a particular data set to identify the underlying 
graph structure, and apply the same method to other data sets and so on. Or, clustering methods are 
used on various data sets to find subgroups or classify them. However, we would argue that there are 
deep relationships between clustering and network inference and they can potentially cover each other's 
shortcomings. Spatio-temporal gene expression patterns result from both the network structure and 
the integration of regulatory signals through the network [17]. Moreover, by comparing gene expression 
levels in the various perturbation conditions, we might reveal the subtype graph structure and understand 
heterogeneity across various perturbations. 

In this paper, we demonstrate that our RPCA-based method helps to find distinct subtypes and 
classify data sets in a robust way. In order to interpret multi-dimensional spatio-temporal data sets, 
it is common to compare the responses over experiments and find differences by looking raw data. As 
the dimension of high-throughput data increases, this is not possible in practice. The proposed method 
provides a way to interpret multi-dimensional data sets. The low-rank representation provides the large- 
scale features and the sparse component shows the interesting details such as genomic aberration-specific 
responses. The intuition behind this is that one can recover the principal components of a data matrix 
even though a positive fraction of its entries are arbitrarily corrupted or a fraction of the entries are 
missing as well [9], 
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Also, although there is a wealth of literature describing canonical cell signaling networks, little is 
known about exactly how these networks operate in different cancer cells. Therefore, a possible extension 
of the proposed method is that once we extract common responses, we apply inference algorithms to 
identify the unified structure using these common responses. Or, we can also focus on individual sparse 
components to identify the heterogeneity of network structure across cells of different type. Advancing our 
understanding of how these networks are deregulated across cancer cells and different targeted therapies 
will ultimately lead to improve effectiveness of pathway-targeted therapies. 

Conclusion 

In this study, we develop a new method for clustering and analyzing multi-dimensional biological data. 
Wc illustrate how the proposed method can be useful to extract common event-related neural features 
across many experimental trials. Also, we show that the proposed method helps to find distinct subtypes 
and classify data sets in a robust way by separating common response and abnormal responses without 
any prior knowledge. We are currently applying our method to analyze and cluster RPPA data set of 
the HER2 positive breast cancer and trying to identify underlying graph structures. 

Methods 

Random Projection (RP) and Identifiability 
Random Projection(RP) 

Recent theoretical work has identified random projection as a promising dimensionality reduction tech- 
nique [19]. Projecting the data onto a random lower-dimensional subspace preserves the similarity of 
different data vectors, for example, the distances between the points are approximately preserved. Also, 
RP can reduce the dimension of data while keeping clusters of data points well-separated [19]. More- 
over, using RP is substantially less expensive to compute than using techniques such as PCA (Principal 
Component Analysis) because RP is data-independent. 

The idea of RP is that a small number of random linear projections can preserve key information. 
Theoretical work [19] [20] [21] [22] guarantees that with high probability, all pairwise Euclidean and 
geodesic distances between points on a low-dimensional manifold are well-preserved under the mapping 
'J : K" — > K m , m < n. Consider a linear signal model 

n 

y(t) = *x(t) = ^x 4 (^e]R m (7) 
i=i 

where iff = [-01 ip2 ■■■ VVt] is an to x n projection matrix whose elements are drawn randomly from 
independent identical distributions. First, note that the dimensionality of the data x is reduced since 
to < n. Also, if we define 3^ = [e^y 1 ^]; e^y 2 ^]; e^y'M] G Ri xNt where e, is TO-dimensional unit 
vector and Y = [y t y 2 ... y m ] , then Y T = (* ® Iat t )X t or Y = X(* T (g) I Nt ) where (g) represents 
the Kronecker product and In t € M. NtxNt is an identity matrix. 

In [19] , Dasgupta showed that even if the original distribution of data samples is highly skewed (having 
an ellipsoidal contour of high eccentricity), its projected counterparts will be more spherical. Since it is 
conceptually much easier to design algorithms for spherical clusters than ellipsoidal ones, this feature of 
random projection can simplify the separation into the low-rank and sparse components. Therefore, we 
can reduce the computational complexity of the non-smooth convex optimization, in particular l\ and 
nuclear norms minimization, used in RPCA 1 . 



iJVlany speedup methods were developed in optimization by avoiding large-scale SVD. In [23], Mu et al. demonstrated 
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Identifiability 

Suppose our input X in equation (6) can be decomposed as X = L + S = Yli=i CiUjV| + ^i^b* 
where Oi are the positive singular values, Uj £ M. qxl , v* £ E lxn ' WT are the left- and right-singular vectors 
of L, and represents the rank of the matrix L. ds is the number of sparse components in S, and 
&i £ M. qxl ,hi £ K lxn Ar T are sparse with only one nonzero entry respectively. By using RP, we have for 

Y, 

Y = X(* t ®Iat t ) =XR = LR + SR 

dh ds 

= 2 ^u*(R T v*r + J2 <WR T bi)* 

i=l i—l 
dL ds 

= ^ °» U «V* + ^2 Ai£li ^i ( 8 ) 

i=l i=l 

where we denote (\& T <E) 1n t ) by R. As we mentioned above, our input X is sparse, so the singular vectors 
of the low-rank matrix L might not be reasonably spread out. However, by using RP (multiplying by 
R), the singular vectors Vj of the resulting matrix become reasonably spread out. 



Cluster Analysis 
Overview: Dissimilarity 

Common measures of dissimilarity for data include Euclidean distance [13], \\x — y\\ = \/Y^i=ii x i ~ Ui) 2 
where x and y are p-vectors of measurements on the objects to be clustered. Also, Manhattan distance 
dxy = 52f=i \ x i ~ Di\ i s used, and the "1- correlation" distance is defined as follows 

, _ , _ , ELi(^ -x){yj -v) 

xv ~ Pxy $Xii*<-zn /2 \ELi(vi-vn' 3 ( ) 

The 1- correlation distance is bounded in [0, 2]. This dissimilarity is invariant to changes in location or scale 
of cither x or y. The 1- correlation dissimilarity can be related to the more familiar Euclidean distance: 
if x = 7= x ~ x = == and y = ===, then II i — h|| 2 = 2»(1 — p xv ). That is, squared Euclidean 

distance for standardized objects is proportional to the correlation of the original objects. For microarray 
data, the choice of a dissimilarity measure makes it a popular choice for biological applications. Changes 
in the average measurement level or range of measurement from one sample to the next are effectively 
removed by this dissimilarity. 



Missing data and corruption 

As we mentioned, in microarray data, missing data and corrupted data are quite common so in order 
to deal with missing values, one can modify clustering methods or impute a "complete" data set before 
clustering. For example, we consider highly-correlated signal xl — sin(t) + n± and = sin(f) + 
where t is time step and n\,n2 are Gaussian noise A/"(0, a). Now, we add a sparse corruption (xs) to the 
original signal (xl) as shown in Figure 10(a) and calculate the dissimilarity between x corr (= ij, + x$) 
and y CO rr(— Vl + 0). Even though we choose the (i-sparse corruption of xs where <i(<C p) is the number 
of nonzero component in xg, the correlation is degraded as shown in Figure 10(b) (left). Assuming that 

the power of projected matrix nuclear norm by reformulating RPCA and in [24], Zhou et al. demonstrated the effectiveness 
and the efficiency of Bilateral Random Projections. However, both methods consider a dense matrix X while in this paper 
we consider the case when the input matrix X is sparse. 
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we know the corruption signal xs and ys, we can decompose x corr , y CO rr as <p = [xl',%s] & R 2p and 
ip = \_Vl\ ys] G K 2p respectively. In Figure 10(b)(middle), the red square represents the corruption signal 
where ys = 0. Since corruption signal changes the mean and the variance, the correlation is still degraded 
in (b) (middle). We introduce 7 so that we allow different weighting factors for (xliVl) arid (xs,ys) 
respectively. For example, we choose small 7 for the corruption signal (xs,ys)- 

Therefore, in order to deal with corrupted signals and cluster them, we should separate the original 
signal and corruption signal first and then calculate the dissimilarity with adjusting weighting factor 
7. For a gene expression time series data set, when a gene is knocked out, systems are subjected to 
controlled perturbations and the broad gene expression levels of other genes are perturbed. We can 
reveal extra information by comparing gene expression levels in the perturbed system with those in the 
original system. Since abnormal behaviors or different responses to external stimuli or different cell lines 
can be extracted from the original data using the information available in the data set, we could cluster 
data and reveal biological meaningful patterns. 

Our approach: a new 1-correlation distance 



x* ' y a a 

We rewrite the "1- correlation 11 distance (9) as d xy = 1— — - ' where x, y € W, x = x—x-l p , y = y—y-l p 

\\ x \\ \\y\\ 



and l p = [l ... l] and consider the separation as follows: 



•''L 

xs 



€ R 2 p and ip 



€ M 2p where 



x = Xx. + xs, y = yi, + ys and the subscript L,S represent low-rank component and sparse component. 
We define "1-correlation" distance for (f>, ip as follows: 



— 1 — — 1 



) 2 ] 1/2 E^i(^-^) 2 ] 1 / 2 



(10) 



where 



2p 



d(fyib{= 1 — Pc/>ip) is as follows: 



and ip = y^j— 1 V'i — \v- The relation between d xy (= 1 — p xy ) and 



dxy — 1 



<j> ■ lp 



and d^ — 1 



1> 



where x — x — x ■ l p — [l p I p \ 



p-dimensional identity matrix, 



Xj^ 2 * 
x s - § ■ lp 



[Ip Ip] 



Ip Op 
Op l p 



1 

>- 0. 



Ip Ip 
Ip Ip 



i2p) = [Ip Ip] 4>,y= [Ip Ip] i>, Ip is 



4> = v xy 4>, ip = Vxyi), vj y v xy 



Ip Ip 
Ip Ip 



V2V XV h 0 



and T%iiP<H> = 1 ' V H> = 

Therefore, d xy uses the mixture of low-rank component and sparse component but d^ calculates the 
correlation based on the separation. Also, in order to adjust the weighting factor as shown in Figure 



10(b) (right), we simply denote V, 



Ip Op 



Op 7l p 

Lemma 1. If the sparse component is zero, d^ = d xy 



where 7 is a weighting factor. 



Proof. Since xs = 0 and ys = 0, we can simply consider <j>, ip as 
respectively and 7 = 1 



XL 



and Tp 



□ 
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For the disentanglement, we propose the RPCA-based (Robust Principal Component Analysis, [9]) 
method which uses the information available in the data set in order to identify similar expression 
patterns 2 . 
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Figures 

Figure Legends 

Figure 1. Multi-dimensional Spatiotemporal data where we consider various experiments with different 
perturbations, doses, mechanism, tasks, etc. 

Figure 2. Conceptual representation: (a) RPCA applied to computer vision. A typical example of video 
surveillance where the low-rank component represents the unchanging background and the sparse com- 
ponent represents the movements in the foreground, (b) RPCA applied to neural systems. The low-rank 
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component putatively represents (submovement relevant) neural signatures and the sparse component 
represents neural activity unrelated to submovement onset, (c) Collections of gene-knockout experiments 
and mutant-specific part representations (breast cancer signaling pathway) with wild-type, Lapatinib treat- 
ment, Akt inhibitor and mutant cell lines where solid black edges represent common network topology, and 
blue and red edges represent a single change of the network topology for perturbations or mutant cell lines. 

Figure 3. The low-rank matrices from both RPCA and RP-RPCA where X are input matrices and we 
choose m — n — 64 for the comparison (contrast represents activity of neuron, i.e., high contrast repre- 
sents highly modulated neural activity and white color represents zero neural activity), (left) raw-data 
(center) low-rank component using RPCA and (right) low-rank component using RP-RPCA. 

Figure 4. Statistics of a numerical example: we run RPCA for ^corruption and Y corruption (We added 
sparse corruption to X) . Left y-axis represents the norm of sparse component and the right y-axis shows 
the rank of L (more detailed information in Figure SI and S2). 

Figure 5. Receiver Operating Characteristic (ROC) curve of the prediction of submovement onset: (a) 
comparison between RPCA and RP-RPCA (target jumps task) (b) different monkeys or tasks where we 
prcfiltcrcd certain submovements with small amplitude in order to avoid artifacts of overfitting. 

Figure 6. Gene knockout experiments [4](16 perturbations x 15 gene expressions x 4 time points [0, 1, 48, 
72h]): (upper) raw data (middle) low-rank component and (lower) highly corrupted sparse component 
using threshold. 

Figure 7. Clustered group: (left) hierarchical cluster and (right) the proposed method. Both clustered 
results compare with graphical representation generated by M. Moasser. 

Figure 8. Heat maps showing average response based on both raw data and disentanglement result 
within subtype to targeted therapeutics: (l st column) HER2+/PI3K wild type, (2„d column) HER2+/PI3K 
mutant. Each column consists of average responses of raw RPPA, low-rank component and sparse com- 
ponent. Each row represents targeted therapeutics alone and in combination (LAP, AKTi, both). In the 
PI3K mutation, we can see up-regulation of S6 pS235, pS240 and p70S6K pS371 in the short-term (in 
the sparse component, red) (more detailed information for each cell line in Figure S4)- 

Figure 9. Clustered group using RPPA data set: (a,b,c) hierarchical cluster and (d,e,f) the proposed 
method. 

Figure 10. Simple example: (a) green solid line with circle (-o-) represents y CO rr(— Vl + 0) and blue 
solid line with circle (-•-) represents x CO rr{— xl + xs) where filled circle (•) represents corrupted data, 
unfilled circle (o) represents uncorrupted data (xl) and unfilled square (□) represents corruption signal 
(xg) (b) Xcorr-Vcorr plot with 1- correlation distance (d xy ) without modification(left), with disentangle- 
ment(middle), and with disentanglement/weighting factor 7. 

Figure SI. (a) (upper) Input matrix X and singular value decomposition (SVD) (X = U X S X V*). 
(lower) Randomly projected input matrix Y and SVD (Y = U y S y V*). Note that since rank(X)=6, 
U x G M 9X6 ,S a; G M 6x6 ,V* G l 6Xn,JVT . In order to show how well singular vectors are spread out, we 
show the absolute value of each component. White represents zero value, (b) RPCA results. We run 
RPCA for sparsely corrupted ^corruption, ^ corruption- (We added sparse corruption to X as shown in 
Figure S2.) Left y-axis represents the norm of X — L and the right y-axis shows the rank of L. 
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Figure S2. The out of RPCA and RP-RPCA with two different A values: (a) For A = 0.113, both L r P ca 
and L rp_rpca have rank 6 (« rank(X)) as shown in Figure 4(b). There is a big difference between S rpca 
and the constructed corrupted signal (X — X corr ) (b) For A* = 0.141, S rp_rpca is close to X — X corr but the 
low-rank components are misidentified by both RPCA and RP-RPCA because both L rpca and L rp " rpca 
have rank 15. Therefore, for RP-RPCA, the separation of the low-rank component and sparse component 
is close to the true solution but for original RPCA, we have misidentification in both the low-rank and 
sparse components. We can easily see that S rpca shows characteristics of the low-rank component in 
Figure S2 (middle columns of each panel) . 

Figure S3. Abstract HER2 overexpressed breast cancer model. 

Figure S4. Separation result: (l s t column) raw data (2 n£ j column) low-rank component and (3 rc j col- 
umn) highly corrupted sparse component using threshold (Ml: H1047R (kinase domain mutation), M2: 
E545K (helical domain mutation), and M3: K111N mutation in PIK3CA). 
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Figure 1. Multi-dimensional Spatiotemporal data where we consider various experiments with 
different perturbations, doses, mechanism, tasks, etc. 
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Figure 3. The low-rank matrices from both RPCA and RP-RPCA where X are input matrices and we 
choose m — n = 64 for the comparison (contrast represents activity of neuron, i.e., high contrast 
represents highly modulated neural activity and white color represents zero neural activity), (left) 
raw-data (center) low-rank component using RPCA and (right) low-rank component using RP-RPCA. 
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Figure 4. Statistics of a numerical example: we run RPCA for ^corruption and ¥ corruption (We added 
sparse corruption to X) . Left y-axis represents the norm of sparse component and the right y-axis shows 
the rank of L (more detailed information in Figure SI and S2). 
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Figure 5. Receiver Operating Characteristic (ROC) curve of the prediction of submovement onset: (a) 
comparison between RPCA and RP-RPCA (target jumps task) (b) different monkeys or tasks where we 
prcfiltcrcd certain submovements with small amplitude in order to avoid artifacts of overfitting. 
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Figure 6. Gene knockout experiments [4] (16 perturbations x 15 gene expressionsx4 time points [0, 1, 
48, 72h]): (upper) raw data (middle) low-rank component and (lower) highly corrupted sparse 
component using threshold. 
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Figure 7. Clustered group: (left) hierarchical cluster and (right) the proposed method. Both clustered 
results compare with graphical representation generated by M. Moasser. 
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Figure 8. Heat maps showing average response based on both raw data and disentanglement result 
within subtype to targeted therapeutics: (l s t column) HER2+/PI3K wild type, (2 n d column) 
HER2+/PI3K mutant. Each column consists of average responses of raw RPPA, low-rank component 
and sparse component. Each row represents targeted therapeutics alone and in combination (LAP, AKTi, 
both). In the PI3K mutation, we can see up-regulation of S6 pS235, pS240 and p70S6K pS371 in the 
short-term (in the sparse component, red) (more detailed information for each cell line in Figure S4)- 
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(b) Akt Inhibitor 




(c) LAP + Akt Inhibitor 
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Figure 9. Clustered group using RPPA data set: (a,b,c) hierarchical cluster and (d,e,f) the proposed 
method. 
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(a) x corr ,y corr S K p , and ci-sparse corruption x s where (p = 20, d = 6) 
d = 0.43 d = 0.36 d= 0.05 





-2-1 0 1 2 

(b) d^: cyan dot line represents a linear fit of uncorruptcd data {xdVl) an d re d solid line represents that of corrupted 
data {x corr ,y C orr) or 



Figure 10. Simple example: (a) green solid line with circle (-o-) represents y C orr(— Ul + 0) and blue 
solid line with circle (-•-) represents x corr (— xl + xs) where filled circle (•) represents corrupted data, 
unfilled circle (o) represents uncorrupted data (xl) and unfilled square (□) represents corruption signal 
(xs) (b) x corr -y corr plot with 1- correlation distance (d xy ) without modification(lcft), with 
disentanglement(middle), and with disentanglement/weighting factor 7. 



Downloaded from http://biorxiv.org/on September 18, 2014 



30 



Supplementary Information 
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Figure SI. (a) (upper) Input matrix X and singular value decomposition (SVD) (X = U X S X V*). 
(lower) Randomly projected input matrix Y and SVD (Y = U y S y V*). Note that since rank(X)=6, 
U x € R qx6 , T, x e R 6xe , V* € K6xn-iVr_ in orc |er to show how well singular vectors are spread out, we 
show the absolute value of each component. White represents zero value, (b) RPCA results. We run 
RPCA for sparsely corrupted ^corruption, Y corruption- (We added sparse corruption to X as shown in 
Figure S2.) Left y-axis represents the norm of X — L and the right y-axis shows the rank of L. 
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Figure S2. The out of RPCA and RP-RPCA with two different A values: (a) For A = 0.113, both 
L rpca and L rp_rpca have rank 6 (~ rank(X)) as shown in Figure 4(b). There is a big difference between 
grpca and the cons tructed corrupted signal (X - X corr ) (b) For A* = 0.141, S rp - rpca is close to X - X corr 
but the low-rank components are misidentified by both RPCA and RP-RPCA because both L rpca and 
jyp-rpca nave ran k 15 Therefore, for RP-RPCA, the separation of the low-rank component and sparse 
component is close to the true solution but for original RPCA, we have misidentification in both the 
low-rank and sparse components. We can easily see that S rpca shows characteristics of the low-rank 
component in Figure S2 (middle columns of each panel). 
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Figure S3. Abstract HER2 overexpressed breast cancer model. 
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Figure S4. Separation result: (l st column) raw data (2 n d column) low-rank component and (3 r( 2 
column) highly corrupted sparse component using threshold (Ml: H1047R (kinase domain mutation), 
M2: E545K (helical domain mutation), and M3: K111N mutation in PIK3CA). 



